Milestone 2

Description of dataset

What is the data source? (1-2 sentences on where the data is coming from, dates included, etc.)

The monkeypox case data is from European Centre for Disease Prevention and Control. The population denominator, world region, and census data are from Eurostat.

How does the dataset relate to the group problem statement and question?

Problem statement 1: Is there a difference between monkeypox cases by region in the EU? We will be using the monkeypox data, population denominator and region data sets to help us find the difference between monkeypox cases by region in the EU.

Problem statement 2: Is there a difference between country level monkeypox case rates by certain demographic factors? We will use the monkeypox case, population denominator and census data to see if there is a difference between country level case rates based on certain demographic factors.

Import statement

library(readr)
library(dplyr)
library(tidyverse)
library(stringr)
library(lubridate)
library(knitr)

Use appropriate import function and package based on the type of file

mp_cases <-read_csv("~/r_for_ph_monkeypox/euro_mpx_cases.csv")
pop_den <- read_csv("~/r_for_ph_monkeypox/euro_pop_denominators.csv")
eu_census <- read_csv("euro_census_stats.csv")
wrld_region <- read_csv("~/r_for_ph_monkeypox/world_country_regions.csv")

Utilize function arguments to control relevant components (i.e. change column types, column names, missing values, etc.)

clean_eu_census <- rename_with(eu_census, ~ tolower((gsub(" ", "_",.x,fixed=TRUE)))) %>%
  select(c(-"flags",-"footnotes",-"edu"))
clean_mp_cases <- rename_with(mp_cases, ~ tolower((gsub(" ", "_",.x,fixed=TRUE)))) %>%
  select(c(-"source"))
clean_pop_den <- rename_with(pop_den, ~ tolower((gsub(" ", "_",.x,fixed=TRUE)))) %>%
  select(c(-"dataflow", -"last_update", -"freq", -"obs_flag"))
clean_wrld_region <- filter(wrld_region, region == "Europe") %>%
  select(c(-"alpha-3", -"iso_3166-2", -"intermediate-region"))

Document the import process 1. We navigated to the PHW290/phw251_projectdata git repo. We clicked on Code and downloaded the Zip file which we then extracted the csv files from. We imported the relevant datasets into our RStudio files for this project.

Identify data types for 5+ data elements/columns/variables

Identify 5+ data elements required for your specified scenario. If <5 elements are required to complete the analysis, please choose additional variables of interest in the data set to explore in this milestone.

daterep = date confcases = numeric sub_region =character age = character edu = character countrycode = character county_code = character geo = character

Identify the desired type/format for each variable—will you need to convert any columns to numeric or another type? We would have desired for the age variable to be numeric.

Utilize functions or resources in RStudio to determine the types of each data element (i.e. character, numeric, factor)

str(clean_pop_den)
## tibble [603 × 4] (S3: tbl_df/tbl/data.frame)
##  $ indic_de   : chr [1:603] "JAN" "JAN" "JAN" "JAN" ...
##  $ geo        : chr [1:603] "AD" "AD" "AD" "AD" ...
##  $ time_period: num [1:603] 2011 2012 2013 2016 2018 ...
##  $ obs_value  : num [1:603] 78115 78115 76246 71732 74794 ...
str(clean_eu_census)
## tibble [152,534 × 7] (S3: tbl_df/tbl/data.frame)
##  $ country_code: chr [1:152534] "AT" "AT" "AT" "AT" ...
##  $ sex         : chr [1:152534] "F" "F" "F" "F" ...
##  $ age         : chr [1:152534] "Y_GE85" "Y_GE85" "Y_GE85" "Y_GE85" ...
##  $ cas         : chr [1:152534] "ACT" "ACT" "ACT" "ACT" ...
##  $ time        : num [1:152534] 2011 2011 2011 2011 2011 ...
##  $ res_pop     : chr [1:152534] "500000-999999" "10000-99999" "200000-499999" "100000-199999" ...
##  $ pop         : num [1:152534] 0 4 5 6 6 8 18 19 21 25 ...
str(clean_mp_cases)
## tibble [2,987 × 4] (S3: tbl_df/tbl/data.frame)
##  $ daterep    : Date[1:2987], format: "2022-05-09" "2022-05-09" ...
##  $ countryexp : chr [1:2987] "Austria" "Belgium" "Bulgaria" "Croatia" ...
##  $ countrycode: chr [1:2987] "AT" "BE" "BG" "HR" ...
##  $ confcases  : num [1:2987] 0 0 0 0 0 0 0 0 0 0 ...
str(clean_wrld_region)
## tibble [51 × 7] (S3: tbl_df/tbl/data.frame)
##  $ name           : chr [1:51] "ALAND ISLANDS" "ALBANIA" "ANDORRA" "AUSTRIA" ...
##  $ alpha-2        : chr [1:51] "ax-248" "al-8" "ad-20" "at-40" ...
##  $ country-code   : num [1:51] 248 8 20 40 112 56 70 100 191 203 ...
##  $ region         : chr [1:51] "Europe" "Europe" "Europe" "Europe" ...
##  $ sub-region     : chr [1:51] "Northern Europe" "Southern Europe" "Southern Europe" "Western Europe" ...
##  $ region-code    : num [1:51] 150 150 150 150 150 150 150 150 150 150 ...
##  $ sub-region-code: num [1:51] 154 39 39 155 151 155 39 151 39 151 ...

Provide a basic description of the 5+ data elements

Numeric: mean, median, range Character: unique values/categories Or any other descriptives that will be useful to the analysis

We will use the daterep column from the clean_mp_cases data set to establish the number of confirmed cases per week or per month. daterep lists the date that the cases were confirmed on.

We used the summary function to give us a summary of monkeypox cases in the entire dataset

summary(clean_mp_cases$confcases)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   5.715   1.000 655.000

We will the sub-region column to categorize the different EU countries by region.

We will use the eu census dataset to look at the age and education demographic(columns “age” and “edu”) and their relationship to monkeypox cases.

We will use the countrycode/geo/country_code columns which are listed in all datasets to join our datasets as needed in the future milestone(s).

Milestone 3

Subset rows or columns,create new varbiables and clean

unique(clean_mp_cases$countrycode)
##  [1] "AT" "BE" "BG" "HR" "CY" "CZ" "DK" "EE" "FI" "FR" "DE" "EL" "HU" "IS" "IE"
## [16] "IT" "LV" "LT" "LU" "MT" "NL" "NO" "PL" "PT" "RO" "SK" "SI" "ES" "SE"
unique(clean_pop_den$geo)
##  [1] "AD"        "AL"        "AM"        "AT"        "AZ"        "BA"       
##  [7] "BE"        "BG"        "BY"        "CH"        "CY"        "CZ"       
## [13] "DE"        "DK"        "EA18"      "EA19"      "EE"        "EL"       
## [19] "ES"        "EU27_2020" "EU28"      "FI"        "FR"        "FX"       
## [25] "GE"        "HR"        "HU"        "IE"        "IS"        "IT"       
## [31] "LI"        "LT"        "LU"        "LV"        "MC"        "MD"       
## [37] "ME"        "MK"        "MT"        "NL"        "NO"        "PL"       
## [43] "PT"        "RO"        "RS"        "RU"        "SE"        "SI"       
## [49] "SK"        "SM"        "TR"        "UA"        "UK"        "XK"
pop_den_geo <- filter(clean_pop_den,geo %in% c("AT", "BE", "BG", "HR", "CY", "CZ", "DK", "EE", 
                                               "FI", "FR","DE", "EL", "HU","IS", "IE" ,"IT", "LV", 
                                               "LT","LU", "MT", "NL", "NO","PL", "PT", "RO", "SK",
                                               "SI", "ES", "SE"),time_period == 2022) %>%
                        select(-indic_de)
date_mp_cases <- clean_mp_cases %>%
   group_by(countrycode, countryexp, month = floor_date(daterep, 'month')) %>%
    summarize(monthly_total_cases = sum(confcases, na.rm = T))
join_df <- left_join(pop_den_geo, date_mp_cases, by = c("geo" = "countrycode"))
join_df$countryexp <- str_to_upper(join_df$countryexp)
regions_df <-clean_wrld_region %>%
  select(name, `sub-region`)
join_region_mp <- inner_join(regions_df, join_df, by = c("name" = "countryexp"))%>%
  mutate(rate = round(monthly_total_cases/obs_value * 100000,3))
rate_region_mp <- join_region_mp %>%
  group_by(`sub-region`,month)%>%
  summarize(sum_rate = sum(rate))%>%
  ungroup()
ggplot(rate_region_mp, aes(x=month, y=sum_rate)) + 
  geom_bar(stat = "identity")+
  facet_wrap(~ `sub-region`) +
  labs(title = "Monthly Monkey rates by EU region ")

sum_obs_value <- join_region_mp %>%
  group_by(`sub-region`)%>%
  summarize(sum_obs = sum(obs_value))

Data dictionary based on clean dataset (minimum 4 data elements), including:

Data Dictionary

Dataset name:join_region_mp

  • Name :month

  • Data type: DATE

  • Description: A grouping of the of all the dates in the month and is listed as for example “2022-05-01”. It includes all the days in that respective month.

  • Name: monthly_total_cases

  • Data type: numeric

  • Description: The sum of monthly monkeypox cases per country.

  • Name: rate

  • Data type: numeric

  • Description: The rate is the number of monthly total cases divided by the obs_value per 100000, rounded to 3 decimal places.

Dataset name: rate_region_mp

  • Name: sum_rate
  • Data type: numeric
  • Description: The sum monthly rate of monkeypox cases in each sub-region.

Table

summary(join_region_mp$monthly_total_cases)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     2.0    13.5   152.4    59.5  3244.0
summary(join_region_mp$rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.04775 0.24450 0.68122 0.86450 6.83900
summary(rate_region_mp$sum_rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0440  0.5703  3.3385  4.7686  7.3628 15.3850
summary(sum_obs_value$sum_obs)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 154996244 306264194 446101430 451725616 591562852 759703360
library(kableExtra)
df_ds <- data.frame("Median" = c(13.5, 0.24450, 3.3385,446101430),
                    "Mean" = c(152.4, 0.68122, 4.7686, 451725616),
                    row.names = c("Monthly total cases", "Rate", 
                                  "Summary rate", "Obs value"))
kable(df_ds,booktabs=T, digits= c(1,1,1,0),
 caption = "Descriptive statistics for data elements")
Descriptive statistics for data elements
Median Mean
Monthly total cases 13.5 152.4
Rate 0.2 0.7
Summary rate 3.3 4.8
Obs value 446101430.0 451725616.0

Milestone 4

rate_region_mp <- join_region_mp %>%
  group_by(`sub-region`,month)%>%
  summarize(sum_rate = sum(rate))%>%
  ungroup()

ggplot(rate_region_mp, aes(x=month, y=sum_rate)) + 
  geom_bar(stat = "identity",fill = "red")+
  facet_wrap(~ `sub-region`) +
  labs(title = "Monthly Monkey Rates by EU region ",x= "Month", y ="Case rates")

This graph shows the case rates in Europe by region in the months May to August 2022. In the month of July, monkeypox case rates were highest across all regions in Europe. In the month of May, monkeypox case rates were lowest across all regions in Europe.

pivot_rate_region_mp <- rate_region_mp %>%
  rename(Month = "month")%>%
  pivot_wider(names_from = `sub-region`, values_from = sum_rate)

library(formattable)
table_case_region <- formattable(pivot_rate_region_mp, align=c("l","c","c","c","r"),
            list("Western Europe"= (color_tile("white", "orange"))))

The table shows the case rates in different European regions between the months of May and August 2022. Wester Europe had the highest case rates in the month of July.

eu_cens_demo <- clean_eu_census %>% 
  mutate(res_size_cat = case_when(res_pop %in% c("0-1000", "1000-9999", "10000-99999") ~ "small",
                                  res_pop %in% c("100000-199999","200000-499999") ~ "medium",
                                  res_pop %in% c("500000-999999", "GE1000000") ~ "large",
                                   TRUE ~ "unknown"))%>%
  group_by(country_code, res_size_cat) %>%
  summarize(group_pop = sum(pop, na.rm = T)) %>%
  mutate(country_pop = sum(group_pop, na.rm = T),
         group_pct = group_pop/ country_pop) %>%
  ungroup() %>%
  filter(res_size_cat == "small") %>%
mutate(pct_small = case_when(group_pct >= 0.80 ~ "Greater than 80% in Small Cities",
                               group_pct >= 0.65 ~ "Between 65% and 80% in Small Cities",
                               group_pct < 0.65 ~ "Less than 65% in Small Cities"))

res_size <- left_join(eu_cens_demo, join_region_mp, by = c("country_code" = "geo")) %>%
  group_by(country_code, pct_small, month, `sub-region`) %>%
  summarize(cases_by_monthcat = sum(monthly_total_cases, na.rm = T),
            cat_pop = max(country_pop, na.rm = T)) %>%
  mutate(size_cat_rate = (cases_by_monthcat / cat_pop) * 1000000) %>%
  ungroup()


library(plotly)



plot_ly(res_size, 
        x = ~month, 
        y = ~size_cat_rate, 
        type = "bar", 
        linetype = ~pct_small) %>%
  layout(title = "MPX Case Rates in Europe by Percent of Population Living in Small Cities, May-August 2022", xaxis = list(title = "Month", dtick = "M1", ticketformat = "%b<br>%Y"), 
         yaxis = list(title = "Case Rate per 1000000"))

This graph shows the case rates in Europe by the percent of the population living in small cities between the months of May to August 2022. The monthly monkeypox case rates between May 2022 and August 2022 were highest in countries where greater than 80% of the population lived in small cities.

eu_cens_eco <- clean_eu_census %>% 
  mutate(eco_pop = case_when(cas %in% c("EMP") ~ "employed",
                                  cas %in% c("UNE") ~ "unemployed"))%>%
  group_by(country_code, eco_pop) %>%
  summarize(group_eco_pop = sum(pop, na.rm = T)) %>%
  mutate(count_eco_pop = sum(group_eco_pop, na.rm = T),
         group_eco_pct = group_eco_pop/ count_eco_pop) %>%
  ungroup() %>%
   filter(eco_pop == "employed") %>%
 mutate(pct_employ = case_when(group_eco_pct >= 0.30 ~ "Greater than 30% is Employed",
                                group_eco_pct >= 0.25 ~ "Between 25% and 30% is Employed",
                                group_eco_pct < 0.25 ~ "Less than 25% is Employed"))
 
 emp_mp <- inner_join(eu_cens_eco, join_region_mp, by = c("country_code" = "geo")) %>%
   group_by(country_code, pct_employ, month, `sub-region`) %>%
   summarize(cases_by_monthcat = sum(monthly_total_cases, na.rm = T),
             employ_pop = max(count_eco_pop, na.rm = T)) %>%
   mutate(size_emp_rate = (cases_by_monthcat /employ_pop) * 1000000) %>%
   ungroup()
 
   plot_ly(
   emp_mp,
   x= ~month,
   y= ~size_emp_rate,
   linetype = ~pct_employ,
   type = "bar") %>%
   layout(title = "MPX Case Rates in Europe by Percent Employed, May-August 2022",
  xaxis = list(title = "Month"),
  yaxis = list(title = "Case Rate per 1000000"))

This graph shows the case rates in Europe by the percent of emplyed population between the months of May to August 2022.The monthly monkeypox case rates between May 2022 and August 2022 were highest in countries where between 25% and 30% of the population are employed.